Introduction
What is Autoencoder?
Variations of autoencoder models
Autoencoder versus PCA
Applications of Autoencoders
Implementation:Soccer Player Data
In this example, we use the FIFA 23 soccer player data from Kaggle1. FIFA 23 is the most popular soccer videogame. The data contains the following information about all the professional soccer players in FIFA 23:
- Player characteristics: Name, Age, Height, Weight, Club, etc.
- Player skill measures such as Crossing, Finishing, Dribbling, LongPassing, etc.
- Player position and corresponding rating in the game.
For more details of this data, please refer to SOFIFA and Kaggle. The data contains 90 variables, and 42 of them are skill measures. We use PCA and autoencoder to reduce the dimensional of the data and to visualize the data into three/two dimensions with color by player position. We divided the position into four categories:
- Forward (FWD): ST, LW, LF, CF, RF, RW
- Midfielder (MID): CAM, LM, CM, RM, CDM
- Defender (DEF): LWB, RWB, LB, CB, RB
- Goal Keeper (GK)
PCA
We start from PCA as it is less complex than the autoencoder. We first divide the dataset in train set(0.8) and test set(0.2). For PCA, we only use the train set and we use the train and test set together in next section for autoencoder. We provide the following figure to represent the train set by first two component from PCA and colored by player position.
We can clear see one position GK (blue color) is successfully separated
from other potions. However, we see Midfielder and Forward position can
not be separated well. We then remove GK and do PCA on the new dataset
and we will compare PCA and autoencoder on the new dataset to discover
which one can separate three positions good in two/three dimensions.
We then plot the accumulated explained variance by each component in the following figure. The first two components can explain 52% variation in the data and the first three components can explain 60.8%. Althrough the amount of explained variation of the first two/three components is quite large, we still lose a lot information and cannot seperate the Midfielder and Forward position well.
Autoencoder
Then we perform the autoencoder analysis for this dataset. The autoencoder is constructed using the package Keras2 in R. We only consider three layers here: first layer is encoder with 12 nodes, the second layer is bottleneck with 3 nodes and the last layer is decoder with 12 nodes. Here, bottleneck layer has a lower dimensional than the input and output layer, thus can compress the input data and represent them in a lower dimension space.
model <- keras_model_sequential() %>%
layer_dense(units = 12, activation = "relu", input_shape = ncol(x_train)) %>%
layer_dense(units = 3, activation = "relu", name = "bottleneck") %>%
layer_dense(units = 12, activation = "relu") %>%
layer_dense(units = ncol(x_train))
model %>% compile(
loss = "mean_squared_error",
optimizer = "adam"
)
history <- model %>% fit(
x = x_train,
y = x_train,
epochs = 100,
batch_size = 32,
validation_data = list(x_test,x_test)
)
plot(history)Autoencoder fitting results for train and test data
# extract results from bottleneck layer
intermediate_layer_model <- keras_model(inputs = model$input, outputs = get_layer(model, "bottleneck")$output)
intermediate_output <- predict(intermediate_layer_model, x_train)
# col dimension of intermediate_output is 3, same as the nodes in this layer.
# create dataframe for plot figures.
aedf <- data.frame(node1 = intermediate_output[,1],
node2 = intermediate_output[,2],
node3 = intermediate_output[,3])# two dimension
ggplot(aedf, aes(x = node1, y = node2,col =df.rmgk[ids_train,]$BP)) +
geom_point() +
scale_color_discrete(name = "location") + ggtitle("Autoencoder two dimension")# three dimension
plot_ly(aedf, x = ~node1, y = ~node2, z = ~node3, color=df.rmgk[ids_train,]$BP) %>%
add_markers()Conclusion
Reference
https://www.kaggle.com/datasets/cashncarry/fifa-23-complete-player-dataset↩︎
Allaire, J. J., & Chollet, F. (2020). keras: R Interface to ‘Keras’. R package version 2.3. 0.0. Computer software]. https://CRAN.R-project.org/package=keras↩︎